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Abstract 

We revisit the idea of brain damage, i.e. the pruning of 
the coefficients of a neural network, and suggest how brain 
damage can be modified and used to speedup convolutional 
layers in ConvNets. The approach uses the fact that many 
efficient implementations reduce generalized convolutions 
to matrix multiplications. The suggested brain damage pro¬ 
cess prunes the convolutional kernel tensor in a group-wise 
fashion. After such pruning, convolutions can be reduced 
to multiplications of thinned dense matrices, which leads to 
speedup. We investigate different ways to add group-wise 
prunning to the learning process, and show that several¬ 
fold speedups of convolutional layers can be attained using 
group-sparsity regularizes. Our approach can adjust the 
shapes of the receptive fields in the convolutional layers, 
and even prune excessive feature maps from ConvNets, all 
in data-driven way. 


1. Introduction 

In the original Optimal Brain Damage work |[32l of 
25 years ago, LeCun et al. observed that a carefully de¬ 
signed “brain-damage” process can sparsify the coefficients 
of a multi-layer neural network very significantly while 
incurring minimal or no loss of the prediction accuracy. 
Such process resembles the biological learning processes 
in mammals, in whose brains the number of synapses peak 
during early childhood and is then reduced substantially in 
the process of synaptic pruning GD. The optimal brain dam¬ 
age algorithm and its variants, however, impose sparsity in 
an unstructured way. As a result, while a large number 
of parameters can be pruned, the attained level of sparsity 
in the network is usually insufficient to achieve substantial 
computational speedup. 

These days, due to the overwhelming success of very big 
convolutional neural networks (ConvNets) lt30l on a variety 


of machine learning problems, the task of speeding up Con¬ 
vNets has become a topic of active research and engineer¬ 
ing. Generalized convolution , i.e. the operation of convolv¬ 
ing a 4D kernel tensor with the stack of input maps in order 
to produce the stack of output maps, is at the core of Con¬ 
vNets and also represents their speed bottleneck. Here, we 
present a simple approach that modifies the standard gener¬ 
alized convolution process by imposing structured “brain- 
damage” on the kernel tensor. We demonstrate that consid¬ 
erable speed-up of ConvNets can be obtained for a certain 
structure. 

This structure is motivated by the observation that the 
majority of current implementations of generalized convo¬ 
lutions (including the most efficient one at the time of sub¬ 
mission) m E6l [261 E2l S3 [371 compute generalized con¬ 
volutions by reducing them to matrix multiplications (this 
reduction is also referred to as lowering , unrolling , or the 
im2col operation). While unstructured brain damage in a 
convolutional layer, i.e. shrinking some of the coefficients 
of the convolutional kernel tensor to zero, will make one 
of the factor matrices ( the filter matrix) sparse, it will not 
make the overall multiplication run faster. Our idea there¬ 
fore is to group together the entries of the convolutional ten¬ 
sor in a certain fashion and to shrink such groups to zero in 
a coordinated way. By doing this, we can eliminate rows 
and columns from both factor matrices that are multiplied 
when convolution is reduced to matrix multiplication. Re¬ 
peated elimination of rows and columns makes both factor 
matrices thinner (but still dense) and results in faster matrix 
multiplication. 

We demonstrate that conventional group sparsity regu- 
larizer 1471 embedded into stochastic gradient descent min¬ 
imization is able to accomplish group-wise brain damage 
efficiently. The use of group sparsity thus allows us to op¬ 
timize receptive fields in the convolutional network. Our 
approach therefore makes the case for the natural idea of 
using structured sparsity as a simple way to optimize con¬ 
nectivity in deep architectures. 
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In the experiments, we show that a carefully-designed 
group-wise brain damage procedure can sparsify existing 
neural networks considerably. In particular, speed-up fac¬ 
tors exceed those obtained by recent tensor-factorization 
based methods. E.g., we show that group-wise brain dam¬ 
age can accelerate the bottleneck layers of AlexNet (’conv2’ 
and ’conv3’) by a factor of 8.5x simultaneously, while in¬ 
curring only modest (1%) loss of the prediction accuracy. 

2. Related work 

As ConvNets are growing in size and are spreading to¬ 
wards real-time and large-scale computer vision systems, a 
lot of attention is attracted to the problem of speeding up 
convolutional layers. In parallel to the lowering-based ap¬ 
proaches mentioned above, which reduce convolutions to 
matrix multiplications, several works investigate the use of 
fast Fourier transforms MM- Despite the theoretical ap¬ 
peal, the use of Fourier transforms has its own limitations 
(mostly related to memory usage) and most existing pack¬ 
ages stick to the lowering approach, which at the moment 
of the submission is also used by the fastest implementa¬ 
tion (37). 

Alternatively, several recent works investigate various 
kinds of tensor factorization in order to break generalized 
convolution into a sequence of smaller convolutions with 
fewer parameters l2llfT5ll29ll . Using inexact low-rank fac¬ 
torizations within such approaches allows to obtain consid¬ 
erable speedup when low enough decomposition rank is 
used. Our approach is related to tensor-factorization ap¬ 
proaches as we also seek to replace full convolution ten¬ 
sor with a tensor that has fewer parameters. Our ap¬ 
proach however does not perform any sort of decomposi¬ 
tion/factorization for the kernel tensor. Another more dis¬ 
tantly related approach is represented by a group of meth¬ 
ods 12 noma that compress the initial large ConvNet into 
a smaller network with different architecture while trying to 
match the outputs of the two networks. 

Our approach is also related to methods that use struc¬ 
tured sparsity E2EHE3I to discover optimal architectures 
of certain machine learners, e.g. to discover the optimal 
structure of a graphical model lf22l or the optimal recep¬ 
tive fields in the two layered image classifier (25). On the 
other hand, since our approach effectively learns receptive 
fields within a ConvNet, it can be related to other receptive 
field learning approaches, e.g. nuEsi. 

The combination of sparsity and deep learning has been 
investigated within several unsupervised approaches such 
as sparse autoencoders 00 and sparse deep belief net¬ 
works (33). We also note two reports that use some form of 
sparsification of deep feedforward networks and appeared 
in the recent months as we were developing our approach. 
Similarly to |[32lL the work fl4l uses sparsification to reduce 
the number of parameters in the memory-bound scenario. 


Their goal is thus to save memory rather than to attain accel¬ 
eration. In the report of fTTl . the output of the convolution 
is computed at a sparsified set of locations with the gaps be¬ 
ing filled by interpolation. This approach does not sparsify 
the convolutional kernel and is therefore different from the 
group-wise brain damage approach we suggest here. 

Our work focuses on the task of speeding up convolu¬ 
tional layers (as they represent the speed bottleneck) and 
is therefore complimentary to approaches that focus on the 
reduction of size/memory footprint of fully-connected lay- 
ers ri~n[T8ll38ll42ll46l. 

3. Group-Sparse Convolutions 

Below, we discuss the reduction from generalized con¬ 
volution to matrix multiplication (9) and introduce the no¬ 
tation along the way. We then explain the group-sparse 
convolution idea. Generalized convolution within a convo¬ 
lutional layer transforms an input stack of S maps of size 
W' xH ', which can be treated as a three-dimensional ten¬ 
sor (array) U w h s , into an output stack of T maps of size 
W" xH" which form a three-dimensional tensor V w ht • The 
exact relation between W ', H' and W ", H" depends on the 
padding and stride settings within the layer, and our ap¬ 
proach can handle any padding/striding settings seamlessly. 
The transformation is defined by the following formula: 

s 

V(x,y,t) = X K ( i ^,s,t)- (1) 

s=l i=l..d 
j=l..d 

U(x+i -^, y+j -^, s) 

Here, K is a four-dimensional kernel tensor of size 
dxdxSxT with the first two dimensions corresponding to 
the spatial dimensions, the third dimension corresponding 
to input maps, the fourth dimension corresponding to out¬ 
put maps. The spatial width and height of the kernel are 
denoted as d (for simplicity, we assume square shaped ker¬ 
nels and odd d). 

The implementation of ([T]) constitutes the speed bottle¬ 
neck for ConvNets. In [10], it was suggested to reduce 
the computation of all entries of V to the multiplication of 
two large and dense matrices. The reduction allows to use 
highly optimized implementations of dense matrix multipli¬ 
cations (e.g. variants of BFAS |6| libraries) that have been 
developed over many years for all possible computing ar¬ 
chitectures. The reduction proceeds as follows: 

• The kernel tensor K is reshaped into the filter matrix 
F of size T x d 2 S , where the t- th row corresponds 
to a sequence of S 2D filters Af(:,s, t) reshaped in a 
row-wise fashion into row vectors. 
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Figure 1: Standard Generalized Convolution (top) vs. Generalized Convolution after Group-wise Brain Damage (bottom). 
In both cases, we show the diagram for two input maps (S = 2, blue-green color coding). We highlight three output maps 
G, £ 2 , £3 color-coded red-orange-yellow, and we also highlight two spatial locations l\ and I 2 . In both cases, the output map 
stack is obtained by reshaping the product of the filter matrix and the patch matrix. In the standard case, the filters and the 
patches sampled during the formation of the patch matrix are dense. After group-wise brain damage, both the filters and the 
patch sampling patterns are group-sparse (one sparsity pattern per input map), which results in much thinner filter and patch 
matrices and thus leads to much faster matrix multiplication/convolution. 


• The input map stack V is reshaped into the patch ma¬ 
trix P of size cPS x W"H ", where the l -th column 
corresponds to a certain output location l = (x,y) and 
is stacked from the S patches extracted from S input 
maps, all centered at this location and reshaped in a 
row-wise fashion into column vectors. 

• The filter matrix F is multiplied by the patch matrix P 
resulting in a matrix V of size T x W"H" that con¬ 
tains all the elements of V (each column corresponds 
to a certain location and contains the values of this lo¬ 
cation in the T output maps). The multiplication im¬ 
plements ([l| exactly, as each row-by-column product 
within the multiplication corresponds to one instance 
of the computation |T]) for certain (x,y,t). The out¬ 
put tensor (map stack) V can be obtained from V by 
reshaping. 

The construction discussed above has proven to be 
highly successful and is used in the majority of modern 
ConvNets “backends”, e.g. |[TQl[T6l[26l[T2ll45l[37ll . Our key 
idea is to train ConvNets with sparse convolutional kernels 


that are consistent with this construction. 

Such consistency can be achieved if the sparsity patterns 
are aligned in a certain way. Formally, group-wise brain 
damage introduces a sparsity pattern Q s for every input 
map s G 1... S. The sparsity pattern is defined as a sub¬ 
set of the full spatial <i-by-<i grid, i.e. Q s C {1... d} <S> 
{1 ...d}. The convolutional operation then becomes a 
slight modification of Q: 

s 

V{x,y,t) = E E K(i,j, s, t) ■ (2) 

S=1 (i,j)£Qs 

U(x+i-^,y+j-tt±,s) 

The reduction of ^ is an almost straightforward repli¬ 
cation of the procedure (TOI . The only modifications are 
(Figure [I]): 

• When the filter matrix is assembled, each 2D filter K{: 
,s, t) is reshaped into a row-vector of length \Q S \ by 
including only non-zero elements. The filter matrix 
thus becomes of size T x Yls=i l^ s l' 






























































Figure 2: Relative speed-up (with error bars over 100 runs) 
versus density level r, measured for forward propagation in 
the second convolutional layer of LeNet on CPU. Impor¬ 
tantly, the observed speedup is almost linear in the sparsity 
level (diagonal). 


rectangular pattern all the way to one-by-one (this is in line 
with a recent work of m where they consider 2 x 2 fil¬ 
ters for some of their architectures). Note, that with our 
approach we are free to choose non-rectangular filters, and 
in the experiments we found this very useful. 

One of the downsides of this approach is that when de¬ 
signing an architecture with multiple convolution layers, 
there are no clear design principles that can guide the choice 
of the filter shapes. In contrast, the methods discussed be¬ 
low can start with larger filters and then shrink their sizes 
towards optimally-shaped small filters. 

Training with group-sparsity regularizer. Rather than 
fixing the group-sparsity pattern in advance, it is possible 
to find it as a part of learning process while the network 
is trained. A classical way to achieve this is through the 
use of group-sparsity regularization [47), I4lll24ll . Thus, we 
consider a regularizer based on / 2 , 1 -norm: 


• When the patch matrix is assembled, each 2D patch at 
location / = (x,y) in map S is reshaped into a col¬ 
umn vector of size |Q S | by sampling the input map 
C/(:, :,s) sparsely at locations (x+i— ^± 1 , y+j —^^), 
where (i, j) G Q s . The patch matrix thus becomes of 
size Ef=i \Qs\ x W"H". 

As a result of this modification, the multiplication of 
two dense matrices of sizes T x d 2 S and d 2 S x W" H" 
is replaced by the multiplication of two dense matrices of 
sizes T x £ a=1 S\Q s \ and £f =1 \Q a \ x W”H", which re¬ 
sults in the d 2 S/ 2 f=i IQs (-times reduction in the number 
of scalar operations. In our experiments with the reference 
implementation of f27l the wall-clock reduction in the con¬ 
volution time between the original implementation and the 
group-sparse convolution was almost matching the “theo¬ 
retical” speed-up factor (Figure [2]). 

4. Fast ConvNets with Group-sparse convolu¬ 
tions 

We consider two different scenarios that obtain fast Con¬ 
vNets with group-sparse convolutions. First, we consider 
training such networks from scratch, and secondly we con¬ 
sider obtaining such networks by modification of pretrained 
architectures (i.e. performing “brain damage”). 


n 2 ,i(K) = AyyiiY 




- X Y1 \ 

i,3,s \ 




t= 1 


(3) 


where the vector T ZJS denotes the group of kernel tensor 
entries K(i,j,s,:). The effect of the regularizer ^ is in 
shrinking some of such groups to zero in a coordinated fash¬ 
ion. When an entire group F z j s is set to zero, one can set the 
pixel (z, j) in the sparsity pattern to zero, thus increasing 
the group-sparsity. 

For a convolutional layer that is being sparsified, the gra¬ 
dient of ([3]), i.e.: 


dn 2 ,i (K) 
dK(i,j,s,t ) 


K(i,j,s,t) 

yjY?z=\ K (hh s , z ) 2 


(4) 


can simply be added to the gradient of the learning loss 
while performing stochastic gradient updates in the course 
of learning. The coefficient A in ^ and |4]) controls the 
strength of the regularization w.r.t. the main learning loss. 

Generally, using the regularizer ([3]) will result in a group- 
sparsified kernel tensor with some of T ZJS having only near¬ 
zero entries. Because of the stochastic nature of SGD and 
non-differentiability of Z 12 norm near zero, the entries in 
these groups will not be exactly zero, and further postpro¬ 
cessing is needed to nullify the near zero groups and to set 
the sparsity patterns fls accordingly. 


4.1. Training from scratch 

Predefined group-sparsity pattern. The simplest solu¬ 
tion that we consider is to choose the sparsity patterns U 5 
in advance in a data-independent manner, and enforce these 
patterns during the learning of the network. One particu¬ 
lar case of this approach is simply reducing the spatial size 
of filters to a minimum, e.g. three-by-three, or even smaller 


4.2. Sparsifying with Group-wise Brain Damage 

While it is possible to train ConvNets with group-sparse 
convolutions from scratch, the main focus of our paper is 
developing algorithms that can speed-up existing pretrained 
networks that often take excessive time for training. To¬ 
wards this end, we have developed two approaches that 
can accelerate pretrained networks by inflicting group-wise 








MNIST comparison 


brain damage in a way that the drop in the prediction accu¬ 
racy is kept small. In both cases, we assume that we have 
access to the training dataset D , the model was trained on. 

Group-wise sparsification with fine-tuning. Our first 
implementation is also based on the group-sparse regular- 
izer 0- We start with the input ConvNet and run the learn¬ 
ing process on the dataset D with the added regularizer Q. 
After a certain amount of iterations, a predefined number of 
groups Tij s with the smallest /2-norm is set to zero. For a 
desired density level r G [0,1] and respective speedup 1/r, 
we set d 2 S (1 — r) groups to zero, making the respective 
Qs sparse. 

We have found two complications with this approach. 
Firstly, for a given density r it was generally hard to set ap¬ 
propriate regularization strength A in advance without try¬ 
ing several values. Secondly, small r (large speedup) the 
appropriate regularization strength A typically leads to an 
excessive regularization, as many groups end up being bi¬ 
ased towards zero but not close to zero. Because of that, the 
prediction accuracy for such A experienced significant drop 
in the process of learning as compared to the input ConvNet. 

Fortunately, one can recover from most of this drop by 
the subsequent fine-tuning of the network, that follows af¬ 
ter the brain-damage process. For the fine-tuning, we fix 
the sparsity patterns Qs and restart learning without group- 
sparse regularization. We then train for an excessive num¬ 
ber of epochs. As a result of such fine-tuning, the network 
adapts to the imposed sparsity patterns, while the prediction 
accuracy goes up and recovers most of the drop. 

Gradual group-wise sparsification. To avoid the two 
complications discussed above we developed an alternative 
approach that essentially combines the brain-damage and 
the fine-tuning processes, and furthermore avoids most of 
the need for manual search for good meta-parameter values. 
The approach also often leads to considerably better results. 

In this approach, we consider the truncated /i 2 regular¬ 
izer: 

^A^) = A E m in(||r^M) (5) 

i,j,s 

The gradient of ^ equals |5]) when ||r^ s || < 0 and is 
zero otherwise. Informally speaking, the value of 0 con¬ 
trols which groups are considered “promising” and are be¬ 
ing shrinked towards zero, and which groups are considered 
to be too far from zero and therefore stay unaffected by the 
regularizer Q. 

To perform brain-damage, we then create a validation 
set on which we monitor the performance of the network. 
We choose the maximum drop 5 of the prediction accu¬ 
racy on the validation set that we are willing to tolerate. 
We then start with an input ConvNet and perform learning 
with the regularizer while varying 0. Specifically, after 
each epoch we monitor the performance of the network on 
a hold-out set and increase 6 (intensifying brain damage) if 



Figure 3: Accuracy vs. density level on MNIST dataset 
(LeNet architecture) for various ConvNets with group- 
sparse convolutions. We compare the results obtained by 
training with / 2 ,i and li regularizations followed by spar- 
sifications, as well as training with predefined sparsity pat¬ 
terns Qs (black dots). Overall, training with / 2 ,i regularizer 
obtains the best result that can be further improved by fine- 
tuning without regularization. 

the accuracy drop is less than 6 and decrease 6 , thus reliev¬ 
ing certain groups from the effect of the regularizer, if the 
drop is greater than S. 

To perform the actual sparsification, we also introduce 
an additional threshold e<^5. In the process of learning, 
when the norm of a certain group falls below the thresh¬ 
old (i.e. \T ijs || < e) the group is greedily fixed to zero and 
eliminated from the tensor. The sparsity thus monotonically 
increases through the process, and we carry on training until 
the sparsification process stalls, i.e. the system keeps train¬ 
ing with T and performance drop oscilating, while no new 
groups have their legths fall under e for a number of epochs. 
In our experiments, all increments and decrements of 9 was 
based on five-percent quantiles of the groups. I.e. every time 
9 is adjusted, we set 9 to bring 5% of groups T^ s in or out 
of the || r iia ||<0 “territory”. 

Overall, we found the whole procedure to be rather in¬ 
sensitive to the choices of A and e, and overall to be more 
practical and lead to higher group-sparsity and speed-ups 
than those attainable by the sparsification with fine-tuning 
approach. Most importantly, we could use same A and e, as 
well as same shared value of 9 when sparsifiying multiple 
layers simultaneously. 

5. Experiments 

Implementation details. Our implementation is based 
on Caffe m and modifies their original convolution, 
which is implemented as two subsequent layers (the 











method 

density 

speed-up 

accuracy drop 

Accelerating the second convolutional layer of AlexNet 

Denton et al. [15]: Tensor decomposition + Fine-tuning 


2.7x 

-1% 

Lebedev et al. [[29): CP-decomposition + Fine-tuning 


4.5x 

~ 1% 

Jaderberg et al. (211: Tensor decomposition + Fine-tuning 


6.6x 

~ 1% 

Training with fixed sparsity patterns 

0.12 

8.33 

0.82% 

Training with fixed sparsity patterns 

0.2 

5x 

0.16% 

Group-wise sparsification + Fine-tuning 

0.1 

lOx 

1.13% 

Group-wise sparsification + Fine-tuning 

0.2 

5x 

0.43% 

Group-wise sparsification + Fine-tuning 

0.3 

3.33x 

0.11% 

Group-wise sparsification + Fine-tuning 

0.4 

2.5x 

-0.09% 

Gradual group-wise sparsification 

0.11 

9.Ox 

0.28% 

Gradual group-wise sparsification 

0.05 

20x 

1.07% 

Accelerating the second and the third convolutional layers of AlexNet 

Training with fixed sparsity patterns 

0.12 

8.7x 

1.54% 

Training with fixed sparsity patterns 

0.35 

2.9x 

0.36% 

Training with fixed sparsity patterns 

0.54 

1.9x 

-0.53% 

Group-wise sparsification + Fine-tuning 

0.2 

5x 

1.50% 

Group-wise sparsification + Fine-tuning 

0.3 

3.33x 

1.17% 

Group-wise sparsification + Fine-tuning 

0.5 

2x 

0.57% 

Gradual group-wise sparsification 

0.12 

8.5x 

1.04% 

Accelerating all five convolutional layers of AlexNet 

Training with fixed sparsity patterns 

0.34 

3.Ox 

1.34% 

Gradual group-wise sparsification 

0.31 

3.2x 

1.43% 


Table 1: Accelerating convolutional layers of the pretrained AlexNet architecture: results of the two variants of our 
method for various sparsity levels alongside tensor-decomposition based methods (note: the results for [21] are reproduced 
from (29)). 


im2 col-layer that forms the patch matrix and the multipli¬ 
cation layer). To implement the group-sparse convolution 
we focused on the forward propagation step and CPU com¬ 
putation. Most of our methods can be extended for back- 
prop step and for GPUs, however making such extensions 
efficient is non-trivial. For our purpose, we only needed to 
modify the im2 col-layer, so that it can fill in the patch 
matrix while following certain sparsity patterns. 

Datasets. We perform the following experiments. 
Firstly, we consider a small-scale setting, and compare 
training ConvNets with group-wise brain damage from 
scratch with baselines. We use MNIST dataset Oil for 
these small-scale experiments. We then consider a large- 
scale problem, namely ImageNet (ILSVRC) image classi¬ 
fication and the task of accelerating of a pretrained archi¬ 
tecture, namely the Caffe version of AlexNet 128]. We 
also give preliminary results for one of the VGGNet net¬ 
works (43) . 

5.1. MNIST experiments 

We trained the LeNet architecture on the MNIST dataset 
from random initialization while adding the group-sparce 
regularization (section [471] ) while varying the regularization 


strength A and picking the optimal one for each sparsity 
level. The sparsification affects both convolutional layers 
of LeNet, and the same density level r is enforced in both 
layers. We also consider a number of baselines: 

• A simple baseline that trains the network without reg¬ 
ularization and then simply eliminates (set to zero) a 
certain number of groups T^ s with the smallest 12- 
norms. The performance of this baseline was clearly 
below all other methods and it is not reported. 

• Picking sparsity patterns Qs in advance. We consider 
filters with only one central non-zero entry and filters 
with two adjacent central non-zero elements. These 
options correspond to the density of 4% and 8% re¬ 
spectively. The former is essentially equivalent to a 
non-convolutional network. 

• We also consider a simpler non-group-wise sparsifica¬ 
tion by training with 11-norm regularizer (with vary¬ 
ing A) but then nullifying groups |r^- s based on their 
norms. 

The results of the proposed method and the baselines are 
shown in Figure [3] The rightmost plot shows the com- 



























parison of the l\ -envelope, £ 2 , 1 -envelope, and the perfor¬ 
mance of the group-wise brain damage applied to the net¬ 
work trained without sparsity-inducing regularizes The use 
of group-sparsity regularization boosts the performance of 
group-wise brain damage very considerably. Twenty-fold 
acceleration of convolutional layers can be obtained while 
keeping the error low (2.1%, reduced to 1.71% after fine- 
tuning). Using /i-regularizer followed by optimal brain 
damage works worse than ^ 2 , 1 -regularizer. Pre-fixing spar¬ 
sity patterns achieves good results, which are still worse 
than training with grou-sparsity regularizer. Note also that 
all methods except the baseline with the pre-fixed patterns 
can be improved via fine-tuning. 

5.2. ILSVRC experiments 

We first consider the AlexNet (Caffe reimplementation) 
architecture that has five convolutional layers. We consider 
the following subtasks: (i) accelerating the second convo¬ 
lutional layer (which is the slowest of all layers), (ii) accel¬ 
erating the second and the third layers (which are the two 
slowest layers), (iii) accelerating all five convolutional lay¬ 
ers (which together take the vast majority of the forward- 
propagation time). When reporting the final density in sub¬ 
tasks (ii) and (iii), we weigh the densities in different layers 
by the forward propagation times. 

We focused on accelerating the existing network from 
Caffe zoo (Table [T]). It is not clear how initializing network 
with pretrained weights as opposed to a random initializa¬ 
tion affects the final accuracy, but it allowed to shorten train¬ 
ing time in many cases, which is critical in case of large 
networks. We evaluate the variant of our method that trims 
the network according to some predefined sparsity pattern, 
and then learns the network while keeping the same fixed 
pattern. Namely we consider the following symmetric cen¬ 
tered patterns: vertical or horizontal block 1x3, the 3x3 
cross pattern, 3x3 square or diamond shape inside 5 x 5 fil¬ 
ter. 

For the first two subtasks, we evaluated the variant of our 
method with sparsity-inducing regularizer for various spar¬ 
sity levels. For several desired density levels r we searched 
for optimal A through large range with ten-fold increments. 
For each r we pick A that results in the minimal accuracy 
drop after sparsification before fine-tuning. After picking 
the optimal A, we perform fine-tuning. Figure [4] demon¬ 
strate sparsity patterns obtained for different sparsity 
levels. 

Finally, for all three subtasks we evaluated the most ad¬ 
vanced of our methods, namely gradual group-wise spar¬ 
sification. We set the parameters A and e to 0.01 and 0.1 
respectively. We split the test set of ILSVRC randomly into 
two halves and use one of the halves solely to estimate the 
drop of the classification accuracy in the dynamical adjust¬ 
ment of 6. We then report the performance drop on the other 


half of the test set. We set the acceptable performance drop 
to be 1% of top-1 accuracy. 

As shown in Table [TJ the results of gradual sparsification 
outperform the tensor factorization methods as well as spar¬ 
sification with fine-tuning considerably, achieving higher 
group-sparsification/speed-up for similar prediction accu¬ 
racy drop. Notably, the proposed approach is more success¬ 
ful in speeding-up AlexNet than a number of approaches 
based on tensor decomposition. Figure [5] further visualizes 
the process of the simultaneous gradual brain damage in¬ 
flicted on all five layers of AlexNet. 

“External” computer vision task. Convolutional layers 
of large networks pretrained on large annotated training sets 
such as ILSVRC can be used as universal spatially local¬ 
ized features in a variety of ways GUI]]]* which is particu¬ 
larly valuable for problems with considerably smaller train¬ 
ing sets. Recently, m showed that descriptors obtained by 
sum-pooling of the features that emerge in the last convo¬ 
lutional layer of a pretrained network can be used as state- 
of-the-art holistic descriptors for image retrieval. We fol¬ 
lowed their approach (that includes PCA whitening and nor¬ 
malization as postprocessing) to assess the effect of group- 
sparsification on an external task. Comparing AlexNet as a 
base model, and the network with the simultaneous group- 
sparisfication of all convolutional layers from Table [I] with 
3.2x speedup, we have found a negligible drop in perfor¬ 
mance for the INRIA holidays dataset (23) from 0.783 mAP 
to 0.780 mAP, and a reasonably small drop for the Oxford 
Building dataset lf39H from 0.45 to 0.41. 

Preliminary VGGNet results. We have also applied 
the gradual group-wise sparsification to the slowest convo¬ 
lutional layer of VGGNet (the deeper 19 layer version of 
da, starting from its Caffe Zoo version. The sparsifica¬ 
tion obtained the density r = 0.13 with only 0.2% top-1 
accuracy drop. Interestingly, unlike the experiments with 
AlexNet where we rarely observed empty sparsity patterns 
fls (“dead feature maps”), in this example such all-zero pat¬ 
terns were present (29 out of 64), suggesting that this manu¬ 
ally designed architecture contains excessive number of fea¬ 
ture maps in this layer. This result also suggest that our ap¬ 
proach is suitable even for networks with very small initial 
filter sizes in convolutional layers (3x3 for VGGNet). 

6. Discussion 

We have presented an approach to speeding up ConvNets 
that uses the group-wise brain damage process that sparsi¬ 
ties convolution operations. The approach takes into ac¬ 
count the way generalized convolutions are reduced to ma¬ 
trix multiplications, and prune the entries of the convolu¬ 
tion kernel in a groupwise fashion. The exact sparsity pat¬ 
terns can be learned from data using group-sparsity regular¬ 
ization. When applied after learning with such regulariza¬ 
tion and followed by fine-tuning, group-wise brain damage 



(a) sparsity 1 — r = 0.9 (b) sparsity 1 — r = 0.8 (c) sparsity 1 — r = 0.6 


Figure 4: The sparsity patterns obtained by group-wise brain damage on the second convolutional layer of AlexNet for 
different sparsity levels. Nonzero weights are shown in white. In general, group-wise brain damage shrinks the receptive 
fields towards the center and tends to make them circular. 
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Figure 5: The process of sparsification of all five layers in AlexNet. The left plot shows the monotonic growth of the sparsity 
levels of the five convolutional layers as the iterations progress. The middle plot shows the relative prediction accuracy drop 
for the current system for the validation part and for the hold-out test set. Finally, the right part visualizes the process of the 
adjustment of 0 threshold in the truncated £ 2,1 regularization. This plot shows the percentile of groups T^ s with the / 2 -norm 
less than 9.6 is increased or decreased dependent on whether the performance drop on the validation set is greater or smaller 
than 1.2%. 


obtains state-of-the-art performance for speeding up Con- 
vNets. 

Aside from the practical value, the proposed approach 
also makes the case for the use of sparse learning for auto¬ 
mated discovery of optimal network architectures, which is 
arguably one of the main unsolved problems in deep learn¬ 
ing. In our case, group-sparse regularizer allows the model 
to discover optimal receptive fields (Figure [4]). It is inter¬ 
esting to see that the optimization process decided to shrink 
the receptive fields towards the center compared to the full 
version (which is consistent with the findings in |j43l IT9lO . 
Perhaps, even more interesting is to see that in general, 
the learning process decided to make the receptive fields 
roughly circular. Also, the process treated AlexNet and 


VGGNet differently, eliminating entire feature maps by as¬ 
signing their sparsity patterns fig to zero maps in the latter 
case. Note that such elimination brings additional speedup 
(since the entire map needs not be computed in the previ¬ 
ous layer). Such elimination can be explicitly encouraged 
within our approach using hierarchical group-sparsity regu¬ 
larizes |4||24). 
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